This notebook and the contents of this directory were copied from https://github.com/genepattern/single_cell_clustering_notebook.git and adapted for running on ScienceData.

Single-Cell RNA-seq Clustering Analysis Notebook

Authors - Clarence K. Mah, Alex T. Wenzel, and Edwin F. Juárez


Contact Email - ejuarez@ucsd.edu

This notebook is described in https://f1000research.com/articles/7-1306/v2

Summary

The goal of this notebook is to provide a standard single-cell RNA-seq analysis workflow for pre-processing, identifying sub-populations of cells by clustering, and exploring biomarkers to explain intra-population heterogeneity. The workflow is modeled after the Seurat Guided Clustering Tutorial[[1]](https://www.nature.com/articles/nbt.3192) and performs all analyses using the scanpy[[2]](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0) library.

Use Case

Single-cell RNA sequencing (scRNA-seq) has emerged as a popular method to profile gene expression at the resolution of individual cells. Compared to traditional RNA-seq collected from bulk cells or tissue, scRNA-seq enables users to capture cell-by-cell transcriptomic variability. This information can then be used to define and characterize heterogeneity within a population of cells, from identifying known cell types to discovering novel ones. Here we examine a peripheral blood mononuclear cells (PBMCs) dataset, which consists of many immune cell types including T-cells, B-cells, NK-cells, macrophages, and dendritic cells.

Dataset

This notebook analyzes a dataset of 3K Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor available from 10X Genomics, sequenced on the Illumina NextSeq 500.

  • Matrix Data File: 10X Genomics formatted single-cell RNA-seq count matrix of 2,700 single PBMCs

Analysis Overview

scRNAseq_flow_diagram.png

diagram source

Table of Contents

  1. Setup Analysis

    1. Load raw count matrix.
  2. Preprocess Counts

    1. Filter cells based on QC metrics.

    2. Perform log scaling and normalization to total counts.

    3. Remove unwanted sources of variation (number of detected molecules per cell as well as the percentage mitochondrial gene content).

    4. Detect highly variable genes.

    5. Perform linear dimensional reduction (PCA).

  3. Cluster Cells

    1. Cluster cells (graph-based clustering) in PCA space and visualize using t-SNE.
  4. Visualize Cluster Markers

    1. Explore and visualize cluster markers interactively.
  5. Export Analysis Data

    1. Export data to .csv files or a compressed .h5ad format.

Step 1: Setup Analysis

First set up the needed environment. We disable ssl certificate check in case you're gonna work with your data on ScienceData over the private network, i.e. connect to https://sciencedata/files/.

In [7]:
#%load_ext autoreload
#%autoreload 2

import os
import subprocess

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import requests
In [8]:
def download(url):
    fileName = os.path.basename(url)
    if fileName and not os.path.isfile(fileName):
        subprocess.check_output(['wget',url])
In [9]:
def upload(file, folderUrl):
    requests.put(file, folderUrl+'/'+file)
In [10]:
#scriptUrl = 'https://raw.githubusercontent.com/genepattern/single_cell_clustering_notebook/master/singlecell.py'
scriptUrl = 'https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/single_cell_clustering_notebook/singlecell.py'
requirementsUrl = 'https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/single_cell_clustering_notebook/requirements.txt'
In [11]:
download(scriptUrl)
download(requirementsUrl)
In [12]:
os.system("pip show louvain 2>/dev/null > /dev/null; if [ $? -ne 0 ]; then pip install setuptools==57.5.0; pip install -r requirements.txt; fi");
Out[12]:
0

Load raw count data for a single-cell RNA-seq experiment.

This notebook accepts either a single matrix file that includes gene and cell identifiers (as the first column and row respectively) or the three-file output from the 10X Genomics single-cell pipeline.

The single matrix file can be in the following formats: csv, txt, tsv, tab, data, h5, h5ad, loom

Instructions

  • If you have a single matrix file that includes gene and cell (barcode) labels, upload or paste the link to it in the first field below: "matrix data file"
  • If you have the three files from the 10X Genomics pipeline (matrix.mtx, genes.tsv, barcodes.tsv), upload them or provide a link to each file in the corresponding fields: 10x mtx data file, 10x gene name file, 10x barcodes file

  • NOTE: Do not use all four fields, either use the first one or the last three
    NOTE FOR DEVELOPERS: The cell below will download a copy of `singlecell.py` from GitHub to the present working directory. If you want to make modifications to that code, make sure also comment out the part of the code which downloads the file.

For data with 2,700 cells and 33,000 genes, this step will run in 10 to 20 seconds (depending on Internet speed)

Below you can either download data from 10x, unpacked pbmc3k_filtered_gene_bc_matrices.tar.gz and uploaded filtered_gene_bc_matrices to a non-public folder tmpon ScienceData; then uncomment the relevant line. Or you can leave as is and use provided example data.

In [13]:
# Uncomment this to use public example data from ScienceData
data_folder = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/single_cell_clustering_notebook/data/"
# Uncomment this to use data you've downloaded to ScienceData from 10x
#data_folder = "https://sciencedata/files/tmp/"
# Uncomment this to use example data from the notebook authors
#data_folder = "https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/"

matrixDataFile = ""
MTXDataFile = data_folder+"matrix.mtx"
geneNameFile = data_folder+"genes.tsv"
barCodesFile = data_folder+"barcodes.tsv"
In [14]:
# Workaround for bug in GPNB 0.7.2

from IPython.display import display, Javascript

display(Javascript("""

setTimeout(function() {

    $(".gp-widget-call").each(function(i, e) {

        const widget = $(e).data("widget");

        widget.options.append_output = false;

    })

}, 1000);

"""))

import nbtools
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

import sys

sys.path.append('.')
In [15]:
import genepattern
from singlecell import SingleCellAnalysis, sc
import logging
logging.captureWarnings(True)
In [16]:
sc.settings.verbosity = 0

logging.disable(logging.INFO)


%matplotlib inline


# Workaround for bug in GPNB 0.7.2

from IPython.display import display, Javascript

display(Javascript("""

setTimeout(function() {

    $(".gp-widget-call").each(function(i, e) {

        const widget = $(e).data("widget");

        widget.options.append_output = false;

    })

}, 1000);

"""))
At this point, save the notebook and reload the web page in order to have the javascript generate the UI elements.
In [17]:
analysis = SingleCellAnalysis()

genepattern.GPUIBuilder(

    analysis.setup_analysis,

    function_import='analysis.setup_analysis',

    name='Setup Analysis',

    parameters={

        'csv_filepath': {

            'name': 'matrix data file',

            'description': 'Provide your data file either as a URL or local file path. See above instructions for supported formats.',

            'default': matrixDataFile,

            'type': 'file',

            'kinds': ['csv', 'xlsx', 'txt', 'tsv', 'tab', 'data', 'h5', 'h5ad', 'soft.gz', 'txt.gz', 'anndata', 'mtx', 'zip'],

        },

        'gene_x_cell': {

            'name': 'Matrix dimension names',

            'description': 'If the rows of your matrix are genes, choose "gene by cell", otherwise choose "cell by gene"',

            'default': True,

            'type': 'choice',

            'choices': {

                "gene by cell": True,

                "cell by gene": False

            }

        },

        'mtx_filepath': {

            'name': '10x .mtx data file',

            'description': 'If you have data from 10X, upload or link your .mtx file here.',

            'default': MTXDataFile,

            'choices': {'2700 PBMCs from a Healthy Donor (example) - mtx': 'https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/matrix.mtx'},

            'type': 'file',

            'kinds': ['mtx', 'zip']

        },

        'gene_filepath': {

            'name': '10x gene name file',

            'description': 'If you have data from 10X, upload or link your genes.tsv file here',

            'default': geneNameFile,

            'choices': {'2700 PBMCs from a Healthy Donor (example) - genes': 'https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/genes.tsv'},

            'type': 'file',

            'kinds': ['tsv', 'zip']

        },

        'bc_filepath': {

            'name': '10x barcodes file',

            'description': 'If you have data from 10X, upload or link your barcodes.tsv file here',

            'default': barCodesFile,

            'choices': {'2700 PBMCs from a Healthy Donor (example) - barcodes': 'https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/barcodes.tsv'},

            'type': 'file',

            'kinds': ['tsv', 'zip']

        },

        

        'output_var': {

            'hide': True

        }

    }

)

Step 2: Preprocess Counts

Perform cell quality control by filtering based on quality metrics, log normalizing counts, scaling by total counts, and correcting for effects of total counts per cell and the percentage of mitochondrial genes expressed. Finally, detect highly variable genes and perform linear dimensional reduction (PCA).

Instructions

  1. Use the quality metrics displayed in the output of **Step 1** to visually identify thresholds for filtering unwanted cells.


For 2,700 cells and 33,000 genes, this step will run in approximately 10 to 20 seconds

In [18]:
genepattern.GPUIBuilder(

    analysis.preprocess_counts,

    function_import='analysis.preprocess_counts',

    name='Preprocess Counts',

    parameters={

        'data': {

            'description': 'Output from the "Setup Analysis" tool.',

            'default': 'analysis'

        },

        

        'min_n_cells': {

            'name': 'filter genes (min. # of cells)',

            'description': 'Include genes expressed in at least this many cells. Blank will be treated as 0.',

            'type': 'number',

            'default': 3

        },

        

        'min_n_genes': {

            'name': 'filter cells (min. # of genes)',

            'description': 'Include cells with at least this many genes. Blank will be treated as 0.',

            'type': 'number',

            'default': 200

        },

        'max_n_genes': {

            'name': 'filter cells (max # of genes)',

            'description': 'Include cells with at most this many genes. Blank will be treated as no maximum value.',

            'type': 'number',

            'default': 1700

        },

        'min_n_counts': {

            'name': 'filter cells (min. total counts)',

            'description': 'Include cells with at least this many counts. Blank will be treated as 0.',

            'type': 'number',

            'default': 0

        },

        'max_n_counts': {

            'name': 'filter cells (max total counts)',

            'description': 'Include cells with at most this many counts. Blank will be treated as no maximum value.',

            'type': 'number',

            'default': 5700

        },

        'min_percent_mito': {

            'name': 'filter cells (min. % mito. genes)',

            'description': 'Include cells with at least this % of reads mapped to mitochondrial genes. Blank will be treated as 0.',

            'type': 'number',

            'default': 0

        },

        'max_percent_mito': {

            'name': 'filter cells (max % mito. genes)',

            'description': 'Include cells with at most this % of reads mapped to mitochondrial genes. Blank will be treated as no maximum value.',

            'type': 'number',

            'default': 6

        },

        'normalization_method': {

            'name': 'log normalize',

            'description': 'Perform log normalization on the data.',

            'choices': {'Yes': 'LogNormalize', 'No': ''}

        },

        'do_regression': {

            'name': 'perform regression',

            'description': 'Regress out effect of cell counts and percent mitochondrial genes',

            'choices': {'Yes': True, 'No': False}

        },

        'output_var': {

            'hide': True

        }

    })

Step 3: Cluster Cells

Cluster cells using the Louvain method for community detection on `N` principal components. Then use t-SNE to visualize cells, using `N` principal components.

It is important to note that the fewer PCs we choose to use, the less noise we have when clustering, but at the risk of excluding relevant biological variance. Look at the plot in Step 2 showing the percent variance explained by each principle component and choose a cutoff where there is a clear elbow in the graph. The N principal components to the left of this cutoff are then used to cluster the cells.

Instructions

  1. Perform clustering and visualization of the cells using the sliders to tune parameters.


For 2,600 cells and 10 principal components, this step will run in approximately 60 to 90 seconds
In [19]:
genepattern.GPUIBuilder(

    analysis.cluster_cells,

    function_import='analysis.cluster_cells',

    name='Open Cell Clustering Interface',

    description='This step outputs an interactive interface to cluster cells.',
    
    parameters={
        'output_var': {
            'hide': True
        }
    })

Step 4: Visualize Cluster Markers

Visualization the expression of markers on the clustering plot.

Explore genes that are differentially expressed in clusters as potential biomarkers. We provide the Wilcoxon rank-sum test (default) and t-test as test options. The non-parametric Wilcoxon rank-sum test generally performs better for single-cell data since single-cell counts are usually not normally distributed, which the t-test assumes. The Wilcoxon rank-sum test is also less sensitive to outliers.

The following are canonical markers that mark known cell types in this dataset. These can be used to identify what cell types are present and how they correspond to clusters.

| Cell Type | Markers |

| --------- | ------- |

| CD4 T Cells | IL7R |

| LYZCD14+ Monocytes | CD14, LYZ |

| B cells | MS4A1 |

| CD8 T cells | CD8A |

| FCGR3A+ Monocytes | FCGR3A, MS4A7 |

| NK cells | GNLY, NKG7 |

| Dendritic Cells | FCER1A, CST3 |

| Megakaryocytes | PPBP |

Step 4A: Build Cluster Marker Heatmap

Instructions

    Examine the heatmap to understand how distinct each cluster is compared to others. Modify parameters to show more top markers or change the statistical test.


For 2,600 cells and 2,000 variable genes, this step will take approximately 15 to 30 seconds
In [29]:
genepattern.GPUIBuilder(

    analysis.visualize_top_markers,

    function_import='analysis.visualize_top_markers',

    name='Show Cluster Marker Heatmap',

    description='This step outputs an interactive interface to explore gene expression in clusters of cells.',

    parameters={

        'output_var': {

            'hide': True

        }

    }

)

Step 4B: Visualize Marker Expression

Instructions

    Visualize the expression of gene(s) in each cell projected on the t-SNE map and the distribution across identified clusters. Provide any number of genes. If more than one gene is provided, the average expression of those genes will be shown.


For 2,600 cells and 2,000 variable genes, this step will take approximately 5 to 15 seconds
In [21]:
genepattern.GPUIBuilder(

    analysis.visualize_marker_expression,

    function_import='analysis.visualize_marker_expression',

    name='Visualize Marker Expression',

    description='This step outputs an interactive interface to explore gene expression in clusters of cells.',

    parameters={

        'output_var': {

            'hide': True

        }

    }

)

Step 4C: Compare Gene Expression Between Clusters

Instructions

    Use the dropdowns to select which clusters to compare as well as the statistical test to compare them with. Mouse over points on the volcano plot to identify each gene's log2 fold change between clusters and the resulting -log10 p-value.


For 2,600 cells and 2,000 variable genes, this step will take approximately 5 to 15 seconds
In [22]:
genepattern.GPUIBuilder(

    analysis.compare_clusters,

    function_import='analysis.compare_clusters',

    name='Compare Gene Expression Between Clusters',

    description='This step outputs an interactive volcano plot comparing genes between different clusters of cells.',

    parameters={

        'output_var': {

            'hide': True

        }

    }

)

Step 5: Export Analysis Data

Relevant Files

  • X.csv - The preprocessed expression matrix of cells (rows) by genes (columns). This only includes variable genes, usually a much smaller subset of genes compared to the raw counts.
  • obs.csv - Cell annotations including # of genes, % mitochondrial genes, and cluster assignments
  • obsm.csv - Coordinates of cells in various dimensional reduction spaces (e.g., PCA, t-SNE).
  • var.csv - Gene annotations (of variable genes) including the # of cells, mean expression, dispersion, and normalized dispersion statistics.
  • varm.csv - Loadings of cells in various dimensional reduction spaces (e.g., PCA, t-SNE).
  • uns/pca_variance_ratio.csv - % variance explained by each principal component.
  • uns/rank_genes_groups_gene_names.csv - Names of the top ranked marker genes for each cluster.
  • uns/rank_genes_groups_gene_scores.csv - z-scores of the top ranked marker genes for each cluster.

Instructions

Using the cell below, export results as a series of .csv files or compressed .h5ad file. The files will be created and saved to the chosen local data folder.

After creating, you may want to upload to a ScienceData folder.


For 2,600 cells and 2,000 variable genes, this step will take approximately 10-30 seconds
In [23]:
dataSavePath = 'data/analysis'
dataUploadFolderUrl = 'https://sciencedata/files/tmp/'
In [24]:
genepattern.GPUIBuilder(

    analysis.export_data,

    function_import='analysis.export_data',

    name='Export Analysis Data',

    description='Export data as a series of .csv files or compressed .hda5 file.',

    parameters={

        'path': {

            'name': 'filepath',

            'description': 'Name of directory where file(s) will be saved. Exporting as an h5ad file produces a single file output.',

            'default': dataSavePath

        },

        'h5ad': {

            'name': 'file format',

            'description': 'Choose to save either as .csv files or as a single compressed .h5ad-formatted hdf5 file. It is recommended to export .csv files unless you know what you are doing.',

            'default': False,

            'choices': {

                'Comma-separated values (.csv)': False,

                'H5AD file (HDF5 file in the AnnData formatting convention)': True

            },

        },

        'output_var': {

            'hide': True

        }

    }

)
In [34]:
# Evaluate to create a zip archive and upload it to ScienceData

dirName = os.path.basename(dataSavePath.rstrip('/'))
parentName = os.path.dirname(dataSavePath.rstrip('/'))
os.system("cd "+parentName+"; if [ -d '"+dirName+"' ]; then zip -r "+dirName+".zip "+dirName+"; fi; curl --insecure --upload "+dirName+".zip "+dataUploadFolderUrl.rstrip('/')+'/'+dirName+".zip")
updating: analysis/ (stored 0%)
updating: analysis/uns/ (stored 0%)
updating: analysis/uns/louvain.csv (deflated 4%)
updating: analysis/uns/rank_genes_groups.csv (deflated 65%)
updating: analysis/uns/neighbors.csv (deflated 4%)
updating: analysis/uns/pca.csv (deflated 58%)
updating: analysis/var.csv (deflated 49%)
updating: analysis/obs.csv (deflated 63%)
updating: analysis/varm.csv (deflated 61%)
updating: analysis/X.csv (deflated 89%)
updating: analysis/obsm.csv (deflated 55%)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6732k    0     0  100 6732k      0  10.4M --:--:-- --:--:-- --:--:-- 10.4M
Out[34]:
0

Results Interpretation

This analysis employs a scRNA-seq gene expression dataset for 2700 peripheral blood mononuclear cells (PBMCs) from a healthy donor as a demonstration of its use. We are able to recapitulate cell types; the clusters can be characterized by visualizing the expression of canonical markers of these cell types on the 2D t-SNE projection plot as done in Step 3 and Step 4.

In Step 4, the visualization of cluster marker expression helps us understand why some cells we identify visually as a single cluster are divided into multiple. For example, LYZ is overexpressed in a cloud of samples that clustering separates as two distinct clusters. The LYZ gene encodes for human lysozyme, an antimicrobial agent associated with blood monocytes. Using the cluster comparison tool, we can see that one of the subclusters exhibits high relative expression of CD14 while the other exhibits high relative expression of FCGR3A, also known as the CD16 receptor gene. These two genes characterize two known subtypes of blood monocytes respectively; classical and non-classical monocytes.

References

  1. Satija, R., Farrell, J. A., Gennert, D., Schier, A. F., & Regev, A. (2015). Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 33, 495. http://dx.doi.org/10.1038/nbt.3192
  1. Wolf, F. A., Angerer, P., & Theis, F. J. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biology, 19(1), 15. https://doi.org/10.1186/s13059-017-1382-0